Exploration and Visualization

StartR Workshop

Maik Bieleke, PhD

University of Konstanz

November 24, 2024

Data Exploration

The data viewer

RStudio has a built-in data viewer. You can open it by clicking on the dataset in the Environment pane or by applying the View() function to the dataset. It allows you to

  • sort columns
  • filter rows
  • search for text

Overview with str()

A good overview of the structure of the dataset can be obtained with base R function str().

str(fifa)
'data.frame':   17660 obs. of  13 variables:
 $ ID         : num  209658 212198 224334 192985 224232 ...
 $ Name       : chr  "L. Goretzka" "Bruno Fernandes" "M. Acuña" "K. De Bruyne" ...
 $ Age        : num  27 27 30 31 25 27 30 32 28 28 ...
 $ Nationality: chr  "Germany" "Portugal" "Argentina" "Belgium" ...
 $ Club       : chr  "FC Bayern München" "Manchester United" "Sevilla FC" "Manchester City" ...
 $ Reputation : num  4 3 2 4 3 4 4 3 3 3 ...
 $ Height     : chr  "189" "179" "172" "181" ...
 $ Weight     : chr  "82" "69" "69" "70" ...
 $ Overall    : num  87 86 85 91 86 89 86 83 82 88 ...
 $ Potential  : num  88 87 85 91 89 90 86 83 82 88 ...
 $ Value      : num  9.10e+07 7.85e+07 4.65e+07 1.08e+08 8.95e+07 ...
 $ Wage       : num  115000 190000 46000 350000 110000 130000 220000 61000 63000 250000 ...
 $ Foot       : chr  "Right" "Right" "Left" "Right" ...

Overview with glimpse()

An alternative is the dplyr::glimpse() function.

dplyr::glimpse(fifa)
Rows: 17,660
Columns: 13
$ ID          <dbl> 209658, 212198, 224334, 192985, 224232, 212622, 197445, 18…
$ Name        <chr> "L. Goretzka", "Bruno Fernandes", "M. Acuña", "K. De Bruyn…
$ Age         <dbl> 27, 27, 30, 31, 25, 27, 30, 32, 28, 28, 26, 36, 27, 26, 27…
$ Nationality <chr> "Germany", "Portugal", "Argentina", "Belgium", "Italy", "G…
$ Club        <chr> "FC Bayern München", "Manchester United", "Sevilla FC", "M…
$ Reputation  <dbl> 4, 3, 2, 4, 3, 4, 4, 3, 3, 3, 2, 4, 3, 1, 3, 1, 3, 3, 4, 3…
$ Height      <chr> "189", "179", "172", "181", "172", "177", "180", "183", "1…
$ Weight      <chr> "82", "69", "69", "70", "68", "75", "78", "80", "86", "74"…
$ Overall     <dbl> 87, 86, 85, 91, 86, 89, 86, 83, 82, 88, 84, 88, 86, 83, 84…
$ Potential   <dbl> 88, 87, 85, 91, 89, 90, 86, 83, 82, 88, 87, 88, 87, 86, 85…
$ Value       <dbl> 91000000, 78500000, 46500000, 107500000, 89500000, 1055000…
$ Wage        <dbl> 115000, 190000, 46000, 350000, 110000, 130000, 220000, 610…
$ Foot        <chr> "Right", "Right", "Left", "Right", "Right", "Right", "Left…

Dataset properties

  • number of rows

    nrow(fifa)
    [1] 17660
  • number of columns

    ncol(fifa)
    [1] 13
  • dimensions (rows x columns)

    dim(fifa)
    [1] 17660    13
  • names of the variables

    names(fifa)
     [1] "ID"          "Name"        "Age"         "Nationality" "Club"       
     [6] "Reputation"  "Height"      "Weight"      "Overall"     "Potential"  
    [11] "Value"       "Wage"        "Foot"       

First and last rows with head() and tail()

  • show the first 3 rows

    head(fifa, 3) 
          ID            Name Age Nationality              Club Reputation Height
    1 209658     L. Goretzka  27     Germany FC Bayern München          4    189
    2 212198 Bruno Fernandes  27    Portugal Manchester United          3    179
    3 224334        M. Acuña  30   Argentina        Sevilla FC          2    172
      Weight Overall Potential    Value   Wage  Foot
    1     82      87        88 91000000 115000 Right
    2     69      86        87 78500000 190000 Right
    3     69      85        85 46500000  46000  Left
  • show the last 3 rows

    tail(fifa, 3)
              ID            Name Age Nationality           Club Reputation Height
    17658 270567        A. Demir  25      Turkey   Ümraniyespor          1    190
    17659 256624    21 S. Czajor  18      Poland Fleetwood Town          1    187
    17660 256376 21 F. Jakobsson  20      Sweden IFK Norrköping          1    186
          Weight Overall Potential Value Wage  Foot
    17658     82      51        56 70000 2000 Right
    17659     79      50        65 90000  500 Right
    17660     78      50        61 90000  500  Left

Overview of the dataset with skimr()

skimr::skim(fifa)
── Data Summary ────────────────────────
                           Values
Name                       fifa  
Number of rows             17660 
Number of columns          13    
_______________________          
Column type frequency:           
  character                4     
  numeric                  9     
________________________         
Group variables            None  

── Variable type: character ─────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min max empty n_unique whitespace
1 Name                  0             1   3  25     0    17140          0
2 Nationality           0             1   4  24     0      161          0
3 Club                  0             1   0  45   211      927          0
4 Foot                  0             1   4   5     0        2          0

── Variable type: numeric ───────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate       mean          sd  p0     p25    p50      p75      p100 hist 
1 ID                    0             1  246319.     31488.     16 240732. 257041  263028.    271340 ▁▁▁▁▇
2 Age                   0             1      23.1        4.64   15     20      22      26         54 ▇▅▁▁▁
3 Height                0             1     181.         6.96  149    176     181     186        206 ▁▁▇▅▁
4 Weight                0             1      74.3        6.98   48     70      74      79        110 ▁▆▇▁▁
5 Overall               0             1      63.4        8.04   43     58      63      69         91 ▂▇▆▃▁
6 Potential             0             1      71.0        6.53   42     67      71      75         95 ▁▂▇▅▁
7 Value                 0             1 2739788.   7841276.      0 325000  700000 1725000  190500000 ▇▁▁▁▁
8 Wage                  0             1    8190.     20477.      0    550    2000    6000     450000 ▇▁▁▁▁
9 Reputation            0             1       1.11       0.407   1      1       1       1          5 ▇▁▁▁▁

Plotting in R

There are many ways to create plots in R.

  • Base R graphics
  • lattice graphics
  • ggplot2 graphics

The ggplot2 Package

Grammar of Graphics

Leland Wilkinson

ggplot2

Hadley Wickham

Installing and loading ggplot2

We need to install the ggplot2 package once.

install.packages("ggplot2") # if not already installed

Now we can load the package into our current R session.

library(ggplot2)

The ggplot() function

Every plot is initialized with the ggplot() function. It has two main arguments:

  • data specifies the data frame to be used
  • mapping specifies how variables are mapped to visual properties (aesthetics) of geoms

Aesthetics are the visual properties of geoms (e.g., position, color, size).

Basic plotting

Data

We first specify the data frame to be used for plotting. This provides variables (columns) and observations (rows).

ggplot(data = fifa)

Global aesthetics

We map variables of the data frame to global visual properties (aesthetics) of the plot. Here, Overall and Wage are mapped to x and y positions.

ggplot(data = fifa, mapping = aes(x = Overall, y = Wage))

Geometric objects

We add geometric objects (geoms) with + and geom_*(). Here, we draw points at all x and y positions.

ggplot(fifa, aes(x = Overall, y = Wage)) + geom_point()

More geoms and aesthetics

Adding geoms

Geometric layers can be stacked to create more complex plots. Just add more + geom_*() calls.

ggplot(fifa, aes(x = Overall, y = Wage)) +
  geom_point() + geom_smooth()

Adding global aesthetics

Additional global aesthetics can be added to the ggplot() function. They will be applied to all geoms.

ggplot(fifa, aes(x = Overall, y = Wage, color = Foot)) +
  geom_point() +
  geom_smooth()

Adding local aesthetics

Aesthetics can also be added locally to affect only a single geom.

ggplot(fifa, aes(x = Overall, y = Wage)) +
  geom_point() +
  geom_smooth(aes(color = Foot))

Common types of plots

Histograms and density plots

ggplot(fifa, aes(x = Overall)) + 
  geom_histogram()

ggplot(fifa, aes(x = Overall)) + 
  geom_density()

ggplot(fifa, aes(x = Overall)) + 
  geom_histogram(color = "white")

ggplot(fifa, aes(x = Overall)) + 
  geom_density(aes(color = Foot))

Boxplots and violin plots

ggplot(fifa, aes(x = Overall)) + 
  geom_boxplot()

ggplot(fifa, aes(x = Foot, y = Overall)) + 
  geom_violin()

ggplot(fifa, aes(x = Overall)) + 
  geom_boxplot(aes(fill = Foot))

ggplot(fifa, aes(x = Foot, y = Overall)) + 
  geom_violin() + geom_boxplot(width = 0.1) 

Barplots

ggplot(fifa, aes(x = Foot)) + 
  geom_bar()

ggplot(fifa, aes(x = Reputation, 
                 fill = Foot)) + 
  geom_bar(position = "stack")

ggplot(fifa, aes(x = Foot)) + 
  geom_bar(aes(fill = Foot))

ggplot(fifa, aes(x = Reputation, 
                 fill = Foot)) + 
  geom_bar(position = "dodge")

Fine-tuning plots

Facets

Facets can be used to create multiple plots based on a categorical variable.

# Create a plot for each foot
ggplot(fifa, aes(x = Overall, y = Wage)) + 
  geom_point() + geom_smooth() +
  facet_wrap(~ Foot)

Scales

Scales control the mapping from data to aesthetics. They can be manually changed with scale_* functions.

# Change the color used for the points
ggplot(fifa, aes(x = Overall, y = Wage, color = Foot)) + 
  geom_point() + 
  scale_color_manual(values = c("blue", "red")) +
  labs(x = "Ability Score", y = "Wage in KEuro", 
       title = "Relationship between Ability and Wage",
       color = "Preferred Foot")

Theme elements

Theme elements control the non-data components of the plot. They can be changed with theme() function.

# Change the position of the legend from the right to the bottom
ggplot(fifa, aes(x = Overall, y = Wage, color = Foot)) + 
  geom_point() + 
  theme(legend.position = "bottom")

Themes

Themes change the overall appearance of the plot. They can be changed with theme_* functions.

# Use the black and white (bw) theme
ggplot(fifa, aes(x = Overall, y = Wage, color = Foot)) + 
  geom_point() + 
  theme_bw()

Saving plots

The ggsave() function

The ggsave() function saves plots to a file. By default, the last plot is saved into your current working directory.

  • The filename argument specifies the file name. The file type is automatically determined by the file extension (e.g., PNG, JPG).
  • The optional path argument specifies the path to the file.
  • The optional width and height arguments specify the width and height of the plot.

Example

The following code saves a plot as a PNG file into the “figures” folder of the current working directory. The plot is named “myplot” and has a width of 9 inches and a height of 4 inches.

ggsave(filename = "figures/myplot.png", 
       width = 9, height = 4)

There are several additional options, e.g.

  • plot to specify the plot to be saved
  • dpi to specify the resolution of the plot
  • units to specify the units of the width and height arguments